MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis

نویسندگان

چکیده

Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embeddings at one single scale from information within current sentence. Whereas, context in neighboring sentences multi-scale nature of human are neglected, making it challenging to convert multi-sentence text into natural expressive speech. In this paper, we propose MSStyleTTS, a modeling method synthesis, capture predict styles different levels wider range rather than Two sub-modules, including extractor predictor, trained together with FastSpeech 2 based acoustic model. The predictor designed explore hierarchical by considering structural relationships global-level, sentence-level subword-level. extracts embedding ground-truth explicitly guides prediction. Evaluations both in-domain out-of-domain audiobook datasets demonstrate that proposed significantly outperforms three baselines. addition, conduct analysis representations have never been discussed before.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Expressive Speech Synthesis and Modeling

As human beings we communicate with each other through our feelings, which are expressions shaped by the experience and knowledge we have. Since every single state of humans can be related to a particular emotion, the role of emotions in communication cannot be underestimated. There have been studies of the human brain showing the impossibility to make appropriate decisions when the emotion-con...

متن کامل

Uncovering Latent Style Factors for Expressive Speech Synthesis

Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of “style tokens” in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We ...

متن کامل

Hierarchical stress generation with Fujisaki model in expressive speech synthesis

This paper introduces a hierarchical stress generation for expressive speech synthesis. In the previous study, we proposed a novel hierarchical Mandarin stress modeling method, and the text-based stress prediction experiments demonstrates a reliable stress assignment can be obtained from textual features. However, the stress model should be further verified to be an effective and efficient pros...

متن کامل

Modeling the prosody of Vietnamese attitudes for expressive speech synthesis

Attitudes or social affects are strongly implied in interaction processing, and specifically to socio-cultural aspects of language. This paper presents the modeling of attitude to apply in expressive speech synthesis in Vietnamese, an under-resourced tonal language. A prosodic model for Vietnamese attitude is proposed based on the concept of “rendez-vous” between linguistic levels and prosodic ...

متن کامل

Multi-level Exemplar-Based Duration Generation for Expressive Speech Synthesis

The generation of duration of speech units from linguistic information, as one component of a prosody model, is considered to be a requirement for natural sounding speech synthesis. This paper investigates the use of a multi-level exemplar-based model for duration generation for the purposes of expressive speech synthesis. The multi-level exemplar-based model has been proposed in the literature...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing

سال: 2023

ISSN: ['2329-9304', '2329-9290']

DOI: https://doi.org/10.1109/taslp.2023.3301217